Finding canonical forms for historical German text

نویسنده

Bryan Jurish

چکیده

Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any technique or system requiring reference to a fixed lexicon accessed by orthographic form. This paper presents two methods for mapping unknown historical text types to one or more synchronically active canonical types: conflation by phonetic form, and conflation by lemma instantiation heuristics. Implementation details and evaluation of both methods are provided for a corpus of historical German verse quotation evidence from the digital edition of the Deutsches Wörterbuch.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparing Canonicalizations of Historical German Text

متن کامل

Constructing a Canonicalized Corpus of Historical German by Text Alignment ---draft

متن کامل

More than Words: Using Token Context to Improve Canonicalization of Historical German

متن کامل

Manual and semi-automatic normalization of historical spelling - case studies from Early New High German

This paper presents work on manual and semi-automatic normalization of historical language data. We first address the guidelines that we use for mapping historical to modern word forms. The guidelines distinguish between normalization (preferring forms close to the original) and modernization (preferring forms close to modern language). Average inter-annotator agreement is 88.38% on a set of da...

متن کامل

Text Screening (Censorship) in Iran: A Historical Perspective

Censorship has a long history in Iran that has interfered with text production, i.e., original writing as well as translation. This phenomenon seems to have marked the borderline between the government and the ‘enlightened’ intellectuals throughout history in Iran. Different governments have delineated ‘redlines’ for authors and translators and dealt with these constructors of culture based on ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

Finding canonical forms for historical German text

نویسنده

چکیده

منابع مشابه

Comparing Canonicalizations of Historical German Text

Constructing a Canonicalized Corpus of Historical German by Text Alignment ---draft

More than Words: Using Token Context to Improve Canonicalization of Historical German

Manual and semi-automatic normalization of historical spelling - case studies from Early New High German

Text Screening (Censorship) in Iran: A Historical Perspective

عنوان ژورنال:

اشتراک گذاری